The figure on the left hand side shows the percentage of diabetic kidney failure patients out of all kidney failure cases).
According to the NKF (National Kidney Foundation), in Singapore, 5.7 patients on average are diagnosed on a daily bases, making this a critical public health challenge for the society.
Despite the healthcare aspects, a large population with kidney failure is also a huge financial burden for the country, in which $190 million is spent annually on dialysis treatment.
Moreover, up to 1 million diabetic patients is predicted to be diagnosed in 2050.
We propose a random forest model that serves as a prototypical tool for our target users - laboratory diagnosticians/doctors - to decide whether a person is suffering from renal failure as well as the stage of his/her illness. When the professional inputs the required values in accessing health outcomes of a person, the result can be displayed on the screen immediately. If put into use, it is estimated that people can benefit from our application, not including the ones who have not diagnosd with renal failures yet.
Prior to using the data provided to perform exploratory data analysis / train any models, it’s probably wise to do something regarding the data on the left!
We have a lot of missing data (especially the rbc, the rc, and the wc column). Since at least 25% of the data is missing in these variables, it would likely be better if these variables are left out altogether.
Furthermore, we also have missing age data. Since age appeared to be independent of most variables (i.e., not correlated in any way), it would also likely be safer to remove entries whose age are missing (i.e., one cannot predict another’s age solely using the data provided).
mi for model-based imputationsWe note that it would be unwise to remove data as there are only 400 data points in the entire dataset. Nevertheless, we also realized that most data could not be imputed in more “conventional” ways (e.g., hot / cold-deck imputation and median / mean imputation).
The latter was especially the case for categorical variables. Hence, we decided to try and use the mi package from CRAN to impute missing values via chained regression analysis (for numerical data) and thereafter - using these results to predict the class of a data point (for categorical data).
The heatmap to the left shows where missing data is present (i.e., the dark spots)
Based off trial and error, we ultimately decided to calculate 80 iterations (i.e., calculations) for each regression chain. While some imputed values’ means were somewhat different in each chain, we found that 80 iterations gave the best result (i.e., only one variable had highly variable means within each chain).
mi also generates several plots for each variable that has missing data. Shown is an example - the histogram to the left implies that the distribution of data for both observed, imputed, and completed data against values predicted by mi’s regression chain (i.e., the red line).
The middle graphs show that the actual values of the data points predicted by mi are similar to non-missing data points.
The rightmost graphs are residual plots - possibly showing that the values predicted by mi are still within reason.
We see that healthy individuals are in good health - good appetite with none of the mentioned co-morbidies.
We can exploit this feature knowing that no person with kidney disease is also free from hypertension or some other co-morbidity!
Out of curiosity, we also enginnered a new feature agerange that we thought could be useful when model training. We have three categories of ages defined as follows (based off commonly accepted standards in the US):
The amount of healthy individuals appear to increase with decreasing age - could we then use this as a feature in a classifier (i.e., lower age = higher probability of healthiness)?
spec_tbl_df [391 x 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ age : num [1:391] 48 7 62 48 51 60 68 24 52 53 ...
$ bp : num [1:391] 80 50 80 70 80 ...
$ sg : num [1:391] 1.02 1.02 1.01 1 1.01 ...
$ al : Factor w/ 6 levels "0","1","2","3",..: 2 5 3 5 3 4 1 3 4 3 ...
$ su : num [1:391] 0 0 3 0 0 0 0 4 0 0 ...
$ rbc : Factor w/ 2 levels "abnormal","normal": 2 2 2 2 2 1 1 2 2 1 ...
$ pc : Factor w/ 2 levels "abnormal","normal": 2 2 2 1 2 2 2 1 1 1 ...
$ pcc : Factor w/ 2 levels "absent","present": 1 1 1 2 1 1 1 1 2 2 ...
$ ba : Factor w/ 2 levels "absent","present": 1 1 1 1 1 1 1 1 1 1 ...
$ bgr : num [1:391] 121 171 423 117 106 ...
$ bu : num [1:391] 36 18 53 56 26 25 54 31 60 107 ...
$ sc : num [1:391] 1.2 0.8 1.8 3.8 1.4 1.1 24 1.1 1.9 7.2 ...
$ sod : num [1:391] 117 107 121 111 105 ...
$ pot : num [1:391] -4.64 0.478 -3.167 2.5 1.901 ...
$ hemo : num [1:391] 15.4 11.3 9.6 11.2 11.6 12.2 12.4 12.4 10.8 9.5 ...
$ pcv : num [1:391] 44 38 31 32 35 39 36 44 33 29 ...
$ htn : Factor w/ 2 levels "no","yes": 2 1 1 2 1 2 1 1 2 2 ...
$ dm : Factor w/ 2 levels "no","yes": 2 1 2 1 1 2 1 2 2 2 ...
$ cad : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ appet : Factor w/ 2 levels "good","poor": 1 1 2 2 1 1 1 1 1 2 ...
$ pe : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 2 1 1 ...
$ ane : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 1 2 2 ...
$ classification: Factor w/ 2 levels "Diseased","Healthy": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "spec")=
.. cols(
.. age = col_double(),
.. bp = col_double(),
.. sg = col_double(),
.. al = col_double(),
.. su = col_double(),
.. rbc = col_character(),
.. pc = col_character(),
.. pcc = col_character(),
.. ba = col_character(),
.. bgr = col_double(),
.. bu = col_double(),
.. sc = col_double(),
.. sod = col_double(),
.. pot = col_double(),
.. hemo = col_double(),
.. pcv = col_double(),
.. htn = col_character(),
.. dm = col_character(),
.. cad = col_character(),
.. appet = col_character(),
.. pe = col_character(),
.. ane = col_character(),
.. classification = col_character()
.. )
- attr(*, "problems")=<externalptr>
Some commentary here…